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RATES OF CONVERGENCE FOR THE POSTERIOR 
DISTRIBUTIONS OF MIXTURES OF BETAS AND ADAPTIVE 
NONPARAMETRIC ESTIMATION OF THE DENSITY 

By Judith Rousseau 1 

Universite Paris Dauphine and CREST 

In this paper, we investigate the asymptotic properties of non- 
parametric Bayesian mixtures of Betas for estimating a smooth den- 
sity on [0, 1]. We consider a parametrization of Beta distributions in 
terms of mean and scale parameters and construct a mixture of these 
Betas in the mean parameter, while putting a prior on this scaling 
parameter. We prove that such Bayesian nonparametric models have 
good frequentist asymptotic properties. We determine the posterior 
rate of concentration around the true density and prove that it is the 
minimax rate of concentration when the true density belongs to a 
Holder class with regularity /3, for all positive /3, leading to a min- 
imax adaptive estimating procedure of the density. We also believe 
that the approximating results obtained on these mixtures of Beta 
densities can be of interest in a frequentist framework. 

1. Introduction. In this paper, we study the asymptotic behaviour of 
posterior components. There is a vast literature on mixture models because 
of their rich structure which allows for different uses; for instance, they are 
well known to be adapted to the modelling of heterogeneous populations as 
is used, for example, in cluster analysis (for a good review on mixture models 
see [10] or [11] for various aspects of Bayesian mixture models). They are 
also useful in nonparametric density estimation, in particular, they can be 
considered to capture small variations around a specific parametric model, 
as typically occurs in robust estimation or in a goodness of fit test of a para- 
metric family or of a specific distribution (see, e.g., [12, 13]). The approach 
considered here is density estimation, but it has applications in many other 
aspects of mixture models, such has clustering, classification and goodness 
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of fit testing, since in all of these cases, understanding the behaviour of the 
posterior distribution is crucial. Nonparametric prior distributions based on 
mixture models are often considered in practice and Dirichlet mixture priors 
are particularly popular. Dirichlet mixtures have been introduced by [2, 9] 
and have been widely used ever since, but their asymptotic properties are 
not well known apart from a few cases such as Gaussian mixtures, triangular 
mixtures and Bernstein polynomials. The papers [4, 5] and [15] study the 
concentration rate of the posterior distribution under Dirichlet mixtures of 
Gaussian priors, and Ghosal [3] considers the Bernstein polynomial's case, 
that is, the mixture of Beta distribution with fixed parameters. The paper 
[13] considers mixtures of triangular distributions, with a prior on the mixing 
distribution which is not necessarily a Dirichlet process. In all of those cases, 
the authors mainly consider the concentration rate of the posterior around 
the true density, when the latter have some known regularity conditions, or 
when it is a continuous mixture. 

Posterior distributions associated with Bernstein polynomials are known 
to be suboptimal in terms of minimax rates of convergence when the true 
density is Holder. An improvement is obtained in [8] based on a modification 
of Bernstein polynomials leading to the minimax rate of convergence in the 
classes of Holder densities with regularity ft when ft < 1. In this paper, 
we consider another class of mixtures of Beta models which is richer and, 
therefore, allows for better asymptotic results. 

Beta densities are often represented as 

m\ ( I h\ a a - 1 (l-*) h - 1 j}( M r(a)r(&) 

(1.1 g(x\a,b) = — — , B(a,b)= ■ 

B(a,b) r{a + b) 

Here we consider a different parametrization of the Beta distribution writing 
a = a/(l — e) and b = a/e, so that e £ (0, 1) is the mean of the Beta distri- 
bution, and a > is a scale parameter. To approximate smooth densities on 
[0, 1], we consider a location mixture of Beta densities in the form, 

k 

(1.2) g a ,p(x) = ^2pjg a , ej (x), g at£j (x)=g(x\a/(l-e j ),a/e j ), 

i=i 

where the mixing density is given by 

k 

(1.3) P(e) = Y,P^e J (e). 

i=i 

The parameters of this mixture model are then k £ N* , and for each k, 
(a,p\, . . . ,Pk,£i, ■ ■ ■ ,£fc). The prior probability on the set of densities can, 
therefore, be expressed as 

dm-(f) =p(k)ir k (£i,...,e k ,p 1 ,...,p k \a)dTr kja (a), if f = g a ,p, 
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or dir(f) = dn(P\a) d7T2(ct), in the case of a Dirichlet mixture. 

Determining the concentration rate of the posterior distribution around 
the true density corresponds to determining a sequence r n converging to 
such that if 

(1-4) B Tn ={feT,d(f,f )<T n } 

for some distance or pseudo-distance d(-,-) on the set of densities, and 
if X n = (Xi, . . . ,X n ) where the JQ's are independent and identically dis- 
tributed from a distribution having a density /q with respect to Lebesgue 
measure, then 

(1.5) P^BJX™] ->-l, in probability. 

The difficulty with mixture models comes from the fact that it is often 
quite hard to obtain precise approximating properties for these models. The 
papers [14, 18] give general descriptions of the Kullback-Leibler support of 
priors based on mixture models. These results are key in obtaining the con- 
sistency of the posterior distribution, but cannot be applied to obtain rates 
of concentration. In these papers, they use the Kernel structure of mixture 
models. Among such mixture models, location-scale kernels are widely con- 
sidered, mixtures of Betas are not location-scale kernels. However, when a 
gets large, g a)£ concentrates around e, so that locally, these Beta densities 
behave like Gaussian densities. This behavior is described in Section 3. Us- 
ing these ideas, we study the approximation of a density / by a continuous 
mixture in the form 

(1-6) g a j(x)= f(e)g a , e (x)de, 

Jo 

where / is a probability density on (0,1). When a becomes large, g ajE (x) 
behaves locally like a location scale kernel so that g a j becomes close to 
/. Similarly to the Gaussian case, this approximation is good only if / has 
a regularity less than 2. However, by shifting slightly the mixing density, 
it is possible to improve the approximation so that continuous mixtures of 
Betas are good approximations of any smooth density (see Section 3.1). As 
in the case of Gaussian mixtures (see [4, 15]), we approximate the continuous 
mixture by a discrete mixture. In [5], the authors derive a posterior rate of 
concentration of the posterior distribution around the true density when the 
true density is twice continuously differ entiable. In particular, they obtain 
the minimax rate n~ 2 / 5 , up to a logn term under the L\ risk. 

In this paper, we show that the minimax rate can be obtained (up to a 
log n term) for any /3 > by choosing carefully the rate at which a increases 
with n and considering a prior on a leads to an adaptive minimax rate of 
concentration of the posterior. This result has much theoretical and practical 
interest. 
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1.1. Notation. Throughout the paper, X\, . . . ,X n are independent and 
identically distributed as Po, having density /o, with respect to Lebesgue 
measure. We assume that X{ £ [0, 1] . We consider the following three dis- 
tances (or pseudo-distances) on the set of densities on [0, 1] : the L\ distance: 
11/ ~~ <?lli = Jo \f( x ) ~ d( x )\ the Kullback-Leibler divergence: KL(f,g) = 
Jq f(x) log(f(x)/g(x)) dx, for any densities f,g on [0,1] and for any k > 1 

Vk(f,g) = Jo f(x) x \\og(f{x)/g(x))\ k dx. We also denote by WgW^ the supre- 
mum norm of the function g. 

H(L,f3) denotes the class of Holder functions with regularity function 
(3: let r be the largest integer smaller than /3, and denote by its rth 
derivative. 

H(L,/3) = {/ : [0, 1] -> K; - f^(y)\ < L\x - yf~ r }. 

We denote by the simplex, = {y € [0, l] fc ; X)f=i J/» = 
We denote by P^f-lX"], the posterior distribution given the observations 
X n = (Xi, . . . ,X n ), and ^ 7r [-|X n ], the expectation with respect to this pos- 
terior distribution. Similarly Eq and Pff represent the expectation and the 
probability with respect to the true density f^ n and and PJ the expec- 
tation and probability with respect to the distribution f® n . 

1.2. Assumptions. Throughout the paper, we assume that the true den- 
sity /o is positive on the open interval (0,1) and satisfies: 

Assumption A . If /o G W{{3, L), there exist integers < k , ki < (3 such 
that 

/ (fco) (0)>0, / (fel) (l)<0; 

ko and k\ denote the first integers such that the corresponding derivatives 
calculated at and 1, respectively, are nonzero. 

This assumption is quite mild and ensures that fo(x) does not go too 
quickly to when x goes to or 1 so that we can control the Kullback- 
Leibler divergence between /o and mixtures of Betas. 

1.3. Organization of the paper. The paper is organized as follows. In 
Section 2, we give the two main theorems on the concentration rates of 
the posterior distributions under specific types of priors. In Section 3, we 
present some results describing the approximating properties of mixtures 
of Betas. We believe that these results are interesting outside the Bayesian 
framework, since they could also be applied to obtain convergence rates 
for maximum likelihood estimators. This section is divided into two parts. 
First we describe how continuous mixtures can approach smooth densities 
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(Section 3.1), then we approach continuous mixtures by discrete mixtures 
(Section 3.2). Finally, Section 4 is dedicated to the proofs of Theorems 2.1 
and 2.2. 

2. Posterior concentration rates. In this section, we give the two main 
results on the concentration rates of the posterior distribution around the 
true density. We first consider the case of a varying number of components, 
which we call the adaptive prior and then we consider a Dirichlet mixture 
also leading to an adaptive rate of concentration on a more restrictive class 
of densities. In both cases, a diffuse prior on a is considered. Finally, a 
nonadaptive rate is obtained by considering a deterministic sequence a n 
increasing to infinity. We consider a concentration rate in terms of the L\ 
distance, however, the results can be applied to the Hellinger distance as 
well. We first describe the adpative prior. 

Adaptive prior: let / = g aj p and P = Yli=iPi^ei the mixing distribution 
then 

k 

dir(f) =p{k) dn kj2 (j>i, ■ ■ ■ ,Pk) JJ n e (ej)ir a (a) dade x ■■■de k . 

j'=i 

For all k > 0, diVj-2 has a positive density tt/. 2 with respect to Lebesgue 
measure on the simplex S k , which is bounded from below by a term in 
the form c^. Conditionally on k, the £j's, j = 1, . . . , k, are independent and 
identically distributed with density ir e which satisfies 

aie T (l - e) T > n e (e) > a 2 e T {l - e) T Ve G (0, 1), 

for some 0,1,0,2 > 0, and T > 1. We consider the following conditions on the 
prior ir a :ir a is bounded and for all b± > 0, there exist ci, 02,03, A > such 
that for all u large enough, 

7r a (ciu < a < c 2 u) > Ce~ biul/2 , 

ir a (c 3 u<a) <Ce~ blul/ \ 

7r a (a<e- uA )<Ce- blU . 

Let L{k) be either equal to 1 for all k or L(k) = log(k). The distribution 
on k satisfies the following condition: there exist a\,a2 > such that for all 
K large enough, 

e -a 1 KL(J0< p[A . =Jfl < e -a 3 KL(A-) < 

Note that if yfa follows a Gamma distribution with parameters (a, b) 
with a > 1, then the conditions on ir a are satisfied. We have the following 
theorem: 



G 



J. ROUSSEAU 



Theorem 2.1. Consider an adaptive prior, as described above, then 
the posterior distribution satisfies, for all (3 > and /o E H(f3,L) satisfying 
Assumption Aq, 



F"[m\jr] = o P (i) 



with 



Tn = T n-PIW +1 X\ognfP/^ +2 \ if L(k) = log(fc), 
Tn = To n-VW +l \\ognfV^ + V +l l\ if L(jfc) = 1. 

The prior does not depend on /3 so that the procedure is adaptive and 
optimal up to a logn term, since for each f3 > the rate n - ^ 2 ^" 1 " 1 ) is the 
minimax rate of convergence in the class "H(/3,L). 

Dirichlet mixtures form an alternative to the above prior, which is often 
considered in practice, since they lead to efficient algorithms and have in- 
teresting properties for classification models, for instance. We now present 
the asymptotic concentration rate of the posterior based on the following 
Dirichlet mixtures of Beta densities. 

Dirichlet prior: the mixing distribution P follows a Dirichlet process T>(y ) 
associated with a finite measure whose density with respect to Lebesgue 
measure is denoted v and is positive on the open interval (0, 1). Assume also 
that v is bounded and satisfies 

The prior on a, ir a has support [n*, +oo), for some < t < 1 and satisfies, for 
all b\ > 0, there exist c\, C2, C3, C > such that for all a n satisfying a n n - ' — > 
+00 

ir a (aa n <a< c 2 a n ) > Ce~ bl ^", 
ir a (c 3 a n <a) < Ce~ bl ^". 

Note that if \fa = n* + T(a,b), with a, b > 0, then the above condition is 
satisfied. 

Theorem 2.2. Consider a Dirichlet prior then the posterior distribution 
satisfies: for all /o € rl(f3, L) with (3 > 0, and satisfying Assumption Aq, 

pn [B cj X n ]=0p{1) 

with 

Tn = ro n-^ +1 )(logn) 5 ^^ +1 ), ifP< l/i-1/2, 

Tn = ron" 1 / 2 ^ 4 (log n) W+VV/W+D j if p>l/ t - 1/2. 
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Hence the Dirichlet prior implies a minimax adaptive rate of concentra- 
tion on the densities with regularity (3 < 1/t— 1/2. By choosing t small, this 
class of functions is quite large, with small loss in the rates of convergence. 

We could have considered a = a n deterministic and increasing with n, 
which would have implied the following nonadaptive posterior rate, depend- 
ing on a n . 

Corollary 2.1. Consider a prior belonging either to the class of adap- 
tive priors or to the class of Dirichlet prior, as described above, apart from 
the fact that a = a n = o(n) is deterministic. Then if /o £ rl((3,L), and sat- 
isfies Assumption Aq, 

P«[B c Tn \X n ] = o P (l) 

with T n = T (loga n )[an l3/2 V (y/a n log a n /n) 1/2 ]. 

In particular, if a n = n 2 ^ 2 ^ +1 )(logn)~ 3 ^ 2 ^ +1 ), we obtain the minimax 
rate (up to a logn) term r n = ron -/3 /( 2/3+1 )(logn) 5/3// ( 4/3+2 ). Note that deter- 
ministic sequences a n lead to nonadaptive concentration rates. 

These results imply that for any /3 > 0, the optimal rate, in the minimax 
sense, is obtained. Hence the above mixtures of Betas form a richer class of 
models than the Bernstein polynomials or the mixtures of triangular distri- 
butions who lead, at best, to the minimax rates for (3 < 2. It is to be noted, 
however, that Bernstein polynomials and mixtures of triangular densities 
have other interesting properties and are particularly easy to simulate. 

Corollary 2.1 sheds light on the impact of scale parameter. It can 

thus be compared to the scale parameter a n which appears in Dirichlet mix- 
tures of Gaussian distributions. Interestingly, van der Vaart and van Zanten 
[16, 17] also study the impact of scaling factors in nonparametric priors 
constructed as scaled Gaussian processes, and as in our case, considering a 
random scaling factor allows for adaptive, minimax concentration rates. 

In Section 3, we see that the key factor leading to such a rate is the 
possibility of approximating any /o £ rl(L, (3) by a continuous mixture in the 
form g an j with an error of order a n , for some density / close to /o but not 
necessarily equal to /o- An interesting feature leading to this approximating 
property is that g a „,e ac ts locally as a Gaussian kernel around e. However, 
the interest in the Bayesian procedure, compared to a classical frequentist 
kernel nonparametric method, comes from the fact that we do not necessarily 
need to approach /q by g a „,f , which would have constrained us to /3 < 2. 
Indeed, if necessary, we can consider a slight modification / of /o such that 
g a „j approximates fo with an error of order for all (3. This is described 
in the following section. 
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3. Approximation of a smooth density by continuous and discrete mix- 
tures. A Beta mixture, as defined by (1.6) behaves locally like a Gaussian 
mixture, however, its behaviour seems to be richer since the variance adapts 
to the value of x (see Lemma 3.1). In this section, we obtain a way to ap- 
proximate any Holder density / by a sequence of continuous and discrete 
mixtures. We begin with approximating the density by a sequence of contin- 
uous mixtures, and then we approximate the continuous mixtures by discrete 
mixtures. 



3.1. Continuous mixtures. We consider a continuous mixture g a j as de- 
fined in (1.6). This mixture is based on the parametrization of a beta density 
in terms of mean e and scale a. The idea in this section is that when a be- 
comes large, the above mixture converges to /, if / is continuous. We first 
give a result where the approximation is controlled in terms of the supre- 
mum norm, which has an intrinsic interest. We also give a bound on the 
approximation error for Kullback-Leibler-types of divergence, which is the 
required result to control the posterior concentration rate. 



Theorem 3.1. Assume that Jq E T~L(/3,L) and satisfies Assumption Aq, 
with (3 > 0. Then there exists a probability density f\ such that 



/i(x) = / (x) 1 + 



E 

3=2 



Wj(x) 



if 13 > 2; 



f 1 (x) = fo(x), ifP<2, 

where the Wj 's are combinations of polynomial functions of x and of terms 
in the form /q (x)x l (l — x) l /fo(x), I < j, and 

(3.1) \\g aJl - h\\oo<Ca-PI\ 

and for all p > 0, 



(3.2) KL(f ,g aJl )<Ca 



f h 


log( A ) 




\9a,fJ 



Note that if we do not allow f\ to be different from /o, we do not achieve 
the rate oT^ to be true for values of (3 greater than 2. We believe that the 
trick of allowing f\ to be different from /q could be used in a more general 
context of Bayesian mixture distributions (or Bayesian kernel approaches 
as defined in [18]), inducing a greater flexibility of Bayesian kernel methods 
with respect to frequentist kernel methods. 

A Beta density with parameters (a/e,a/(l — e)) can be expressed as 

!>/(£(! -e))) 



T(a/e)T(a/(l-e))' 
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From this, we have the following three approximations that will be used 
throughout the proofs of Theorems 2.1, 2.2, 3.1 and 3.2. Let 



(3.3) 



K(e, x) =elog(e/x) + (1 - e) log((l - e)/(l - z)), 



this is the Kullback-Leibler divergence between the Bernoulli e and the 
Bernoulli x distributions. Then: 



Lemma 3.1. 



9a,e( X ) 



a 



-aK(e,x)/(e(l-e)) 



(3.4) 



27Tx(l — X 
k 



1 + J2 b -M + 0(a-( k+1 > 



3=1 



for any k > and a large enough where the bj (e) are polynomial functions. 
For all k > 0, k\ > 3, we also have, 



9a,e( X ) 



27Tx(l — x) 

a(x — e) 2 



x exp 



(3.5) 



2x 2 (l -x) 2 
(x-e) 



1 + 



x(l — x) 



C(x) + Q kl 



X — £ 

x(l — x) 



l + V^ + 0(a"( fc+1 )) 

3=1 

where R\ < aC\x - e\ kl ~ 2 (x e (l - x £ ))~ kl+2 , 

fei-3 



/ x-e \ _ ^ Q(x) 
Qk \x{l-x)) (x(l 



Ci{x)( x-e) 
x)Y 



and the functions C(x),Ci(x), I < k\, are polynomial where x £ € (x,e) and 
C is a positive constant. Moreover when a\x — e| 3 < Cox 3 (l — x) 3 for any 
positive constant Co, if &2 > 0, and if k\ > 3 V 3&2, there exists C\ > such 
that 



9a,e( x ) 



(3.6) 



27Tx(l — X) 
Z-j|( x (l_ x ))3i 



C(x) + Q fcl 



x — e 
x(l — x) 



+ R 
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1 + E M£>+o (a 



-(fc+l)> 



w/iere |J?| < Cia fc2+1 |:r - e| 3 ( fc2+1 )(x e (l - x e ))- 3 ( fc2+1 ) . 

Note that the term 0(a - ( fe+1 ') appearing in (3.4), (3.5) and (3.6) is uni- 
form in x and e. 



Proof of Lemma 3.1. The proof of (3.4) follows from the expression 
of the Beta densities in the form, 



9a,e( x ) 



r(a/(e(l - e)))e a /( 1 - £ )(l - e) a / £ e -«^(^)/(^(i-e)) 



r(a/e)r(a/(l-e)) 



x(l — x) 



and from a Taylor expansion of T(y) for y close to infinity where we obtain 
that 



r(a/( e (l-e))) 
T{a/e)T{a/(l-e)) 

expl —a 
2ir V 



log(e) log(l - e) 
1 — e e 



)( 1+ f> 



^'(l-eF 



3=1 



a? 



where the 6,-'s are the coefficient appearing in the expansion of the Gamma 
function near infinity (see, e.g., [1]). Putting the three remaining terms to- 
gether results in: for all k > 0, 



e 3 \l-e) 3 



-i 



1 + J> 

j"=i 



1 + E^r + °(«" (fc+1) )' 



where the bj(e)'s are polynomial functions with degree less than 2j. This 
implies (3.4). To obtain (3.5) we make a Taylor expansion of (3.4) as a 
function of e around x. 

K{£ ' x) - {£ ~ x? +Y.c 3 {x) t~ £) l+Ri, 



e(l-e) 2x 2 (l-x) 2 



3=3 



x 3 (1 — x) 3 
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where R\ < R\x — e\ kl+l /(x £ (l — x £ )) kl+1 for some x £ G (x,e), leading to 
(3.5). A Taylor expansion of e y around combined with the above approxi- 
mation of y leads to (3.6). □ 

To prove (3.1), we control the difference between the uniform density on 
[0, 1] and the corresponding Beta mixture g a = g ai£ de. This is given in 
the following lemma. 

Lemma 3.2. For all a > large enough, for all &2 > 1 and k\ > 3(&2 — 1) 
define 

'M=E^+E^=w<W' 



1/2 



then 



1 



a 



<Ca-( fc2+1 )/ 2 (loga) 3 ( fc2+1 



)/2 



where the Bi{x) 's are polynomial functions of x. 

The proof of Lemma 3.2 is given in Appendix A. We now prove Theorem 
3.1. 



Proof of Theorem 3.1. Throughout the proof, C denotes a generic 
positive constant. Let / G %(/3, L) and denote r = \ L f3\ . Then Ve G (0, 1), 



(3.7) 



3=0 



< Liz — el 



The construction of /i is iterative. Let 5 X = 5qx(1 — x)y / loga/a. We bound 



|z-erg a ,e(aj)«fe< 



3o, i£ (z)d£ + 



x+8 x 



g a)£ (x)de 



+ 



X+5 X 



X-& x 



x - £f : g a , £ {x)de. 



Equation (A. 6) implies that for all H > 0, if 5q is large enough, the first 
term of the right-hand side of the above inequality is 0{a~ H ). We treat the 
second term using the same calculations as in the case of Lj in Appendix A, 
so that for all k > 0, 

rx+8 x 

\x - ef g a>e (x) de 
< Ccr^Va - xfE[\AT{0, l)f] + 0{a~ k/2 ). 
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f \x-efg ate {x)de = 0{a- p/2 x p {l-xf) + 0{a- H ) VH > 0, 
J o 



uniformly in x. Then for all H > 0, 

•? ! «/ 



(e - x) 3 ga, £ (x) <fe + f(x)(g a (x) - 1) 



J- Jo 



1 ^"0*0 

(e - x) 3 g at e(x) de + /(sc) 

a 



+ 0(a- /3 /V(l-x /3 ) + a-"), 

uniformly in x, for all H > 0. Using the same calculations as in the compu- 
tation of I3 in the proof Lemma 3.2, we obtain for all j > 1, to the order 
0( a -(fc+J+i)/V(l - x) 3 + a- ff ) 



(e - x) j g a:e (x)ds 



x+S x 



-a(x-e) 2 /(2x 2 (l-x) 2 ) 



2irx(l - x) Jx~5 x 

J, x 



C{x) + Q fcl 
Di{x)x 3 (1 - x) 3 



X — £ 

x(l — x) 



de 



i=i 



a 



(j'+0/2 



so that we can write, 



/ (e- x) 3 g a , e (x) de 
Jo 



x 3 (1 - x) 3 , , . 
— — ^ Mi,a(aj) + 0(q 



x j {l-x) j + a~ H ), 



where ^j )0l {x) is a polynomial function of x with the leading term being 
equal to fij. We can thus write, to the order 0{a~^/ 2 x^ (1 — x@) + a~ H ) 

(3.8) [9aJ - m = ± f U) ^(^-f^) + f{x , m 



3=1 



j\ a j/2 



a 
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Hence if f3 < 2, since fii = 0, 

\9a,f ~ f\(x) < l|/|lo ° /(x) + 0(a-^(l - J)) + 0(a~ H ) 
(3.9) a 

as soon as H > (3/2, leading to (3.1) with f± = f. If j3 > 2, we construct a 
probability density f\ satisfying 

(g aJl - f)(x) = 0(a^V(l - xf) + 0(a~ H ). 

Equation (3.8) implies that /i needs satisfy, to the order 0(a~ H ), 

f['\x)xi {l-x)^ ha {x) / I(x) 



y j, x-x r ^, aW + 1 + 

= /(x)+0(a- /3/2 rr /3 (l-x /3 )). 



To prove that such a probability density exists we construct it iteratively. 
Let 2 < (3 < 3, then set 

/ I(x)\ x(l-x)f(x)C(x)^ x 2 (l-x) 2 f'(x)^ 
hi(x) = f{x) 1 



a J a 2a 

Note that if / G %(L,/3), then inf / > implies h± > for a large enough 
and if /(0) = [/(l) = 0] when x is close to (resp. 1), if 

hminf / ^ ) >0, j = 1,2, 

h\>0 for a large enough on [0, 1]. Assumption Ao implies the above relation 
between / and since 

x fco /( fc °)(xi) / J(x)\ x ko (l-x)f ( - ko \x 2 )C(x)i M 

h n x ) = n 1 



Aft! V a J a(ko — 1)! 

(3.10) 

_ x ko {l-x) 2 f^(x 3 )fi 2 
2a(k -2)! 

with xi,X2,X3 G (0, x). Since f( k °\0) > 0, /ii(x) is equivalent to /(x) for 
a large enough and x close to zero, and h\{x) > for all x G (0,1). Let 
ci = Jo hi(x) dx. Since J" - f](x) dx = 0, 



c 1 = 1 + 0(q 



-(3/2A/3/2)n 



and we can divide /ii by its normalizing constant and obtain the same result 
as before, so that h\ can be chosen to be a probability density on [0, 1]. 
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From this we obtain when ft > 2, 



0/«,fci - /)(*) = ^ (E - <M*) de + h^x) 1 -^ 



J=r-1 

+ 0(a-^V(l-^) 

+ 0(a~ 2A ^ V(l - x)^ 3 ) + 0(a" H ) 

or 



Vfl" >0, 



where w{x) is a combination of polynomial functions of x and of functions 
in the form x\l — x)i f^\x), with j < 3, if ft < 4. H ft < 4, then we set 
/i = /ii (renormalized), else we reiterate. We thus obtain that if rg is the 
largest integer (strictly) smaller than ft/2, 

M 



aw=/w(i+E^). 



where Wj(x) is a combination of polynomial functions and of terms in the 
form f( l \x)x l (l — x) l /f(x), I < 2j. Assumption Ao implies that /i can be 
chosen to be a density when a is large enough and satisfies 

||<7a,/i " /Hoc < CoT^ 2 , 

which implies (3.1). 

If / is strictly positive on [0, 1], then (3.2) follows directly from (3.1). We 
now consider the case where /(0) = [the case /(l) = is treated similarly]. 
Under the Assumption Ao, the previous calculations lead to 

{g a ,h ~ /)(*) = 0(f(x)a-V 2 ) + 0(a~ H ) VH > 0. 

Note also that for a large enough, f\ is increasing between and 5 for some 
positive constant a > 0, so that if x is small enough, 

> f0W* r +s * e - a{x - e?/{2x * {1 - x?)de 

y Jl - 2y&(l - x) J x 

(3.11) 

~ 4 ' 

so that g a j 1 > //8 on [0, 1]. Therefore, since o(x °) 

when x is close to 0, let H > ft and c = coa~ H / k °; for some constant cq large 
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enough, we have 

KL(/, g aJl ) < log 2 J* f(x) dx + J 



+ 



/(*) 



log 1 



-H 



f(x)dx 



dx 



<C(a 



-H(k +l)/k 



+ a~P + a- H ) = 0(a-P). 



-H/Cpfco) 



Similarly, for all p > 0, if c p = coft 

f(x)\log(f{x)/g aJl (x))\ p dx 

< {\og2f Jj f{x)dx + a~ p P 

-H ' 



+ 



Six) 



log 1 



a 



dx 



H > 



<C(ft- 2 ^ fco+1 )/^°) + ft-^ + ft- 
= 0(ft^), 

if H >p(3. This achieves the proof of Theorem 3.1. □ 



In the following section, we consider the approximation of continuous 
mixtures by discrete mixtures in a way similar to [4]. 

3.2. Discrete mixtures. Let P be a probability on [0, 1] with cumulative 
distribution function denoted by P(x) for all x S [0, 1] . We consider a mixture 
of Betas similar to before but with general probability distribution P on 

[0,1], 



9a,p{x) 



g a , e (x)dP(e) 



Let / be a probability density with respect to Lebesgue measure on [0,1]. 
In this section, we study the approximation of g a j by g ay p where P is a 
discrete measure with finite support. 

The approximation of discrete mixtures by continuous ones is studied 
in different contexts of location scale mixtures. See, for instance, [4] or [7], 
Chapter 3, for a general result. Beta mixtures are not location scale mixtures, 
however, as discussed in the previous section; when a is large they behave 
locally like location scale mixtures. In this section, we use this property to 
approximate continuous mixtures with finite mixtures having a reasonably 
small number of points in their support. 
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Theorem 3.2. Let f be a probability density on [0,1], f(x) > for all 
< x < 1, and such that there exists k\,ko S N satisfying f(x) ~ x k °CQ, if 
x = o(l) and /(l — x) ~ (1 — x) fcl ci, if 1 — x = o(l). TTien i/iere exists a 
discrete probability distribution P having at most N = Noy/a(\og a) 3 ^ 2 points 
in its support, such that for all p> 1, for all H > (depending on Mq), for 
a large enough, 



(3.12) 



log 



9aJ 
9a,P 



(x)dx<Ca~ H . 



We can choose the distribution P such that there exists A > with pj > a A 
for all j <N. 



We use this inequality to obtain the following result on the true density 



/o- 



Corollary 3.1. Let /o £ H(L,f3), (3 > 0, be a probability density on 
[0, 1], satisfying fo(x) > for all < x < 1, and such that there exist k±,ko € 
N satisfying |/( fc °)(0)| > and |/ (fcl) (l)| > 0, k ,h < /3. Then for all p > 
1, there exists a discrete probability distribution P having at most N = 
Noy/aQ-Oga) 3 / 2 in its support, with Nq large enough such that 



(3.13) 



KL(/ ,<7 q ,p) < Ca-P, V p (f ,g a , P ) < Ca 



Proof. From Theorem 3.1 there exists f\ positive with f\ = /o(l + 
0{a~ 1 )) and 



KL(f Q ,g aJl )<Ca- 

This implies that 

KL(/o,5 a> p) < KL(f ,g aJl ) + 

-P 



fo(x) \og(g a jJg a)P )(x) dx 



<Ca^ + 8 / g aJl (x) 



log 



9a,P 



{x)dx = 0{a~ p ). 



The same calculations apply to J fo(x)\log(fo(x)/g a ^p(x))\ p dx < Ca 13 , 
which achieves the proof of Corollary 3.1. □ 



Proof of Theorem 3.2. The proof follows the same line as in [4], ex- 
cept that we have to control the approximations in places where the Gaus- 
sian approximation to Betas cannot be applied. Throughout this proof, C 
denotes a generic positive constant. We first bound the difference between 
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both mixtures at all x. By symmetry, we can consider x G [0, 1/2]. Consider 
the following approximation of the exponential: for all s > and all z > 0, 



(3.14) 



3=0 



yS + l 



< 



(s + l)\ 



Equation (3.5) implies that for all k > 1, k± > 3, there exist polynomial func- 
tions of x, Di(x), I < k\, and polynomial functions of e, bj(e), j < k, such 
that if \x — e\ < M5o\/log ax (I — x)/y/a, and setting 



0<z 



9a,e{x) 



(x — e) 2 a 
2x 2 {l-xf 



fci-2 



gi(g)(g--e) 
x'(l — x)' 



< CM 2 log (a), 



27TX(1 — x) 



l + E^# + 0(a" (fc+1) )l 



3=1 



where < aCa fcl / 2+1 / 2 (loga) fcl//2 . Consider eq = a *°, for some posi- 
tive constant to and e 3 - = Eq(1 + My/log a/ \/a) J , j = 1, . . . , J, with 



J 



t log(a) + 21og(log(a)) 



+ l = O(vW 1 °g Q 



log(l + MVloga/v 7 ^) 

Define dFj and dPj the renormalized probabilities dF and dP restricted 
to [Ej,Ej+i) set H > 0. Consider k\ — 1 > 2// and k > H — 1/2 and x £ 
[£j_i,£j_i_2], j > 2, using (3.14) together with the above approximation of 
9aei we consider the moment matching approach of [4] (Lemma A.l) so 
that we can construct a discrete probability dPj with at most N = 2kki s + 1 
supporting points such that for all I < 2sk\,V < k, 



e l b l ,(e)d(F j -P j )(e)=0, 



leading to 



(3.15) 



< 



g a , £ (x)[dF 3 -dPj]{e) 
C 



x(l — x) 
0{a~ H ) 



f^ C s+l M 2{s+l) X a s+l _ 
— h CK 

(s + 1)! 



x(l — x) ' 

if s = so log a with so > C 2 M 4 + 1 and sologso > H. Moreover, for all x < 
Ej—i, using (3.4) and the fact that 

. , r ,, , M 2 (loga)e(l -e) 
aK(£,x)>aK(£,£j) > <! -, 
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when £j + i > e > Sj, we obtain 

9a,e{x)<— C-cM>lo Sa 
X{1 — x) 

for some positive constant c> 0. A similar argument implies that if x > £j+2 



9a,e( X ) ^ 



C 



-cM logo 



x(l — x) 

for some positive constant c > 0. Hence, by constructing P in the form, if 

£j+2 = 1 — ^0 
J 

dP{e) = ^2(F(s j+1 ) - Fie,)) dP^e) + F(e )S {eo) + (1 - F{e J+2 ))5 {£j+2) , 
we finally obtain for all x, 



(3.16) 



g a , e (x)[dF - dP\(e) 



< 



Ca 



-H 



x(l — x) ' 

where P has at most N a = No(loga) 3 / 2 ^/a, for some Nq > related to H. 
We now consider x < eo(l — M y^loga/a). We use the approximation (3.4). 



9a,eo i x ) [ 

Since, when x < sq, 



x(l — x) 



-aK(e ,x)/{e Q (l-eo))^i + Q(a~ 1 )). 



K{e ,x) 
eo(l -£o) 



^(l-eor'OogCeo/x)), 



we obtain 



9a,p{x) > e 



-a\og(eo/x) 



C^F{e ) 



x(l — x) 

and using the above inequalities on g at£ (x) for x < ej-i, we have 

9a,p( x ) - Ca~ H /x(l - x), 

where H depends on M, so that 

|log(0a,p(z))| <Ca|log(x)|. 

Since g a j is bounded (as a consequence of the fact that g a j — f is uniformly 
bounded whenever / is continuous), and since u|log(u)| p goes to zero when 
u goes to zero, 



(3.17) 



9a J (x) 



log 



9a J (x) 
9a,p(x) 



dx < Ca~ to + Ca~*°(loga) p 
= O(a~ t0 (log a) p ). 
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Note also that if a is large enough, 

9a,f{x) > /(x)/4, 

so that g a j(x) > cx k °(l — x) kl for x close to and for all x 6 (eq, 1 — £o), 
for all H > 

|ffa,/(aj) -9<x,p{x) 



<C- 



a 



-H 



g a j{x) ~ x fc o+i(l - x ) fc i +1 
So that if if > i (l + & V fei) + B/p, with 5 > 



< Ca~ H+to{ - l+k ° ykl \ 



9a J (x) 



log 



dx < Ca- t0 {\oga) p + Ca~ B = 0{a~ B ) 



,9a,P(X) 

as soon as to > B. Moreover, we can assume that there exists a fixed A such 
that for all j, pj > a~ A = v. Indeed, let /„ = {j;pj < v}, then consider for 
j I v , pj = cpj and for j £ I v , pj = cv where c is defined by Ylj=iPj = 1- 
This implies in particular that 

|c-l| <vJ< J a- A+1 '\\ogafl 2 . 
Let P = Y.j=oPjde 3 (e) then g a p> cg ajP and if 4 - 1/2 > B, 



KL(g a>f ,g a p) < CoT B + |logc| < Ca 



Also, 



J b a ,p-ffa,p|<a- A+1/2 (loga 



3/2 



hence, if A is large enough, inequality (3.16) is satisfied with P instead of P. 
bmce po = Fi(e ) > F (e )/4 and F (e ) > a- tok °C, by choosing A > t k , 
we obtain that ^ I v and 

9 a> p(x) > 5a,e (^)-^( e o) Vx < e , 

so that (3.17) is satisfied with P instead of P, which leads to: for all B > 
there exists a distribution P, having less than A jr oy / a(log a) 3//2 points in its 
support, satisfying: pj > a~ A for some A > and all j and such that 

9aJ » (x)^ = 0(a~ B ), 



log 



which achieves th proof of Theorem 3.2. □ 

Note, however, that A depends on B and so does Nq. Note also that 
this result could be used to obtain a rate of concentration of the posterior 
distribution around the true density when the latter is a continuous mixture. 

In the following sections, we give the proofs of Theorems 2.1 and 2.2. 
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4. Proofs of Theorems 2.1 and 2.2. To prove these theorems we use 
Theorem 4 of [6]. In particular, let p > 2, and following their notation define 

B*(fo, r,p) = {/; KL(/ , /) < r 2 ; ^(/ 0) /) < r^}. 

We also denote J n (j) = N{T,T n , \\ ■ ||i), the Li metric entropy on the set T n , 
that is, the logarithm of the minimal number of balls with radii r needed 
to cover J- n where T n is a set of densities that will be defined in each of 
the proofs. The proofs consist in obtaining a lower bound on ir(B*(fo,T n ,p)) 
and an upper bound on J n {T n ) when /o belongs to H(f3,L). 

4.1. Proof of Theorem 2.1. Assume that fo G ~H(/3, L) and let r n = a n , 
with a n an increasing sequence to infinity. We first bound from below 
Tr(B*(fo,T n ,p)). Let a G {c\a n ,C2a n ), < ci < C2, using Corollary 3.1 there 
exists a probability distribution with N n = NQ^/a(\og a) 3 / 2 supporting points 
such that 

KL(/ ,«7„,p) < Ca-P, V p (f ,g a ,p) < Ca^, 
with P of the form, 

P(e) = X>* 6J (e), 
j'=i 

£j G (a~ /3 (loga)~ /3_1 , l-a" /3 (loga) _/3 ~ 1 ) andpj > a~ A for all j = 1, . . . , N n 
and some fixed positive constant A. Set £o = a~^(log a)~ /3 ~ l , then e\ > 
Eq. Consider dP'(e) = ^iPi^.( £ ) with |e'- — e,-| < aa~ 71 e.,(l — £,•) and 
\pj — p'j\ < aa _7l+1//2 pj, for some positive constant 71 > 1/2. Note that this 
implies that \p'j —pj\ < 2aa -7l+1 / 2 p^. Then 

(4.1) KL(/ o , 5a>F 0<Ca-^+ / / (x)Iog 

j 

For the purpose of symmetry, we work on x < 1/2. Let M n = My/\oga/y/a, 
when \x — £j\ < M n £j(l — Ej), then Lemma B.l implies that 

= 0(a- (7l - 1/2) v / bi^), 

by choosing k 2 > 2ji - 1 and k 3 > 71 - 1/2. Set /3 > 0, then for all x > e _/3 ° Qn 
and all f such that |x — e'-| > M n Ej(\ — Ej); since £j(l — e^-) > a~ to /2 with 
to> P, Lemma B.l implies that if 71 > to + /3 + 2 

< CA)C^ 7l+2 £ 1 = 0(a" /3 ). 



9a,p( x ) 
9a,P'{x) 



dx. 



9a,e 



■(x) - 1 



- 1 

5a, e'. 
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This implies that if x G (e _/3 ° a , 1 - e _A,a ), 

fen 



9aA x ) _ E k jll(Pj ~ P'j)9 a ,e 3 Ej=lPj(ga,e 3 - -ga, e : 

/ \ J- I 7„ . "T 7 . 



g a ,p'(x) 



(4.2) 



l + 0(a- 7l+2 ). 



Now let x < e /3 ° ct , then |x - £j| > £j(l - £j)/2 for all j = 0, . . . , iV n , and 
there exists c> independent of (3q such that 



a 



x(l — x) ' 



9a,P'{x) < e" 



x(l — x) 



Note also that 



9a,e(x) > C 



x(l — x) 



-aK(e,x)/(e(l-e)) 



where 



aK(e, x) 
eiX-e) 



( 1 1 x \ 

- a i j— ^ lo §( £ / x ) + ~ ~~ e ) + ~ + °( x / £ ) J 

< a ( Iog(e/x) + i log(l - e) + |J + o(l) . 



Consider the function 

/1(e) 



1 x 
log(e/a;) + -log(l -e) + -, 



1-e 



since x < | log ( 1 — e)| for all e G (eo, 1 — £q) h is increasing, and for all e < 1/2, 
fr(e) < 2|log(x)| + 0(1). This leads to 



(4.3) 



9«,p(x) > CP([0, 1/2])— v^_ e 2aiogW - 
x(l — xj 



The same inequality holds for g a pi, which implies that 



fo(x) 



log 



9a,p( x ) 
9a,P'(x) 



dx<Ca p+1 e-P° a Vp>l. 



The same kind of inequalities are obtained for x > 1 — e /3 ° a . Finally we 
obtain 



(4.4) 



fo{x, 



log 



9a,p(x) 
9a,P'{x) 



dx = 0{a~ p ). 



Note that if \p'j — pj\ < a 13 A , then \p'- — pj\ <a @pj, so we need only 
determine a lower bound on the prior probability of the following set under 
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the adaptive prior: set /3q<1/2 

S n = We S Nn ; Ip'j - Pj \ < a-/- A ,j<N n } 

x {\ £j - 4| < a-W- X Ej(\ - £j),j < N n }. 

The prior probability of S ns i = {j/e S Nn ; \p^ - pj\ < a.n l3 ~ A ,j < N n } is 
bounded from below by a term in the form, 



The prior probability of S n ^2 = — £jl < a n 2 ^ — < N n } is 

bounded from below by a term in the form, 

N n 

jj[ £ .(l _ ^jT^JV^+l) > a -iV»[2(^+l)+T]_ 
3=1 

Since iV n < CA' r oy / a^(loga n ) 3 / 2 , for all a G [cia n ,C2a n ] 5 together with the 
condition on a a , we obtain that there exists C\ > independent of iV n such 
that 

(4.5) n(B*(f ,T n ,p))>e- NnCllogn c>e- ClNo ^ losa " )5/2 . 

Set a n = a ^ 2/{2/3+1) (log n)- 5 /(^ +1 ), then r n > Ton-^+i^logn) 5 ^/^) = 

We now determine an upper bound on the entropy on some sieve of 
the support of the adaptive prior. Denote ao n = e - nl/(2,3+1> ( lo g n ) 5,3 ' /(2,3+1> ari d 

ain = a Q X n 2/(2/3+l)( logn )5/3/(2/3+l) ; and ^ 

-^n,a = {(-P,a);A; < /c^,a 0n < a < ai n ;ej > £o,Vj} 

with a ,c> 0, k' n = k' in l /W +1 X\ognyp with q fi = 5/3/(2/3 + 1) if L(k) = 1 
and qp = (3/3 — 1) / (2/3 + 1) if L(k) = log/c in the definition of the prior on 
k, and £o is defined by 

6 =exp{-an 1 /(^ +1 )(logn) 5 ^^ +1 )}. 

Since ir a is bounded, for some c> 0, 



To bound the entropy on J- n a , we use Lemma C.l with the following parametriza- 
tion: Write a = a/(l — e), a' = a'/(l — e'), b = a/e and b' = a'/e' and consider 
p > small enough, then if \a' — a\ < t\ < a and \b' — b\ < t 2 < b, 

x a_ Tl „ 1 , 1 _s b -T2-i B ( a _ T b _ T2 \ 



B(a-T 1 ,b-r 2 ) B(a',V) 
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so that 

B(a- n,b- t 2 ) 



< 1 + T n \g a 'e' ~ 9a,e\ < T~n- 



B(a',b>) 

Consider first a < 2e A (1 — e). If 

(4.6) \e — e'\ < pT n e(l — e), \a — a'\ < pr n a, 
then using case (i) of Lemma C.l and simple algebra, we obtain 

\9a',e' - 9a,e\ < 4pT n . 

We now consider the a,e's such that 2(1 — e) < a < 2e. If 

fAT\ i >\ / T n e(l-e) , apr n 

(4.7) \e — £\<p- — i—j-, —-, \a — a\< 



log(a/(l-e))' 1 '-log(a/(l- e ))' 
then using case (ii) of Lemma C.l and simple algebra, we obtain 

\9a',e> - 9a,e\ < 2p' r n 

for some p' > 0. Last we consider the case where a > 2e V (1 — e). If 
pT n e 2 (l-e) 2 . pe{l-e)r n 



(4.8) \e-e'\< ' n \ \ \a-a'\< 

log(a/e(l — e)) alog(a/e(l — e)) 

then case (iv) of Lemma C.l implies 

\g a ',e' - 9a,e\ < 2p T n 

for some p' > 0. Therefore, the number of intervals in a needed to cover 

(e _ n l/(2 5+ i) (logn)S5 /( 2fl+1 ) < a < aon 2/(2/3+l) (logn) 10/3/(2/3+1)) ig bounded by 

Ji < Cn D e^ < Cn e (-+ 1 )« 1/(2W1, ^«) 5WW+1, ) 

where C, -D are positive constants. We now consider the entropy associated 
with the supporting points of P. The most restrictive relation is (4.8). 

1/7 

Let e n j = e , j = 1, . . . , J with 

j _ an l/(2/3+l)( logn) 5/3/(2/3 + l) _ 



tlog?l fc^t 

so that e ni j = n - *. Let P = Yli=iPi9a,si, and N n j(P) be the number of 
points in the support of P belonging to (e n j,e n j + i). 

The number of intervals following relation (4.8) needed to cover (e n j , e nj+ i) 
is bounded by 



T _ [log(£n,j+i) -log(£nj)]ra 
n J — 

£ n,j 
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for some positive constant D\ independent of t. The number of intervals 
following relation (4.8) needed to cover (n~*,l/2) is bounded by J nj j+i = 
n t+l (\ogn) q for some positive constant q. For simplicity's sake we consider 
D\ = L>2- We index the interval (n _t ,l/2) by J + 1. Consider a configura- 
tion a in the form N n j(P) = kj, for j = 1, . . . , J + 1 where ^ ■ kj = k < k' n , 

and define ^^(a) = {P E .T^a! N n j(P) = kj, j = 1, . . . , J + 1}. For each 
configuration, the number of balls needed to cover F nA {v) is bounded by 

= n/=i Jn,r Moreover, the prior probability of J- n ,a(&) is bounded by 

< r(k + 1) f(fri) ' p ^ ^ c[ <Si - ^ J ' 

for some positive consistent c > and p n ,j+i < 1- We obtain, since T > 1 
and i > 2 



A n = ^ J n{F n) g{o))yJ J n (a) 



„(W)kj +1 /2 ^ £ (T+l)*i/2 



-T+l -,kj/2 
'n,j 



r(k J+1 + i = i; e w T{kj + 1)V2 

x [log(enj+i)-log(e„j)]^ /2 1 
Since 

J 

Jr(^ + l) 1 / 2 < exp(fclog(fc + 1)) < e klo ^ n \ 
i=i 

if tT > 6, we have 

A n < C fc n fcDi r(A: + 1) 1/2 ^ ex Pi~ afc i fc n lo g n [ r i ~ 2]/(2fc^j(j + j0)> 
< C k n k ^ +t / 2+l ^T{k + l)- 1 /2 exp |_ ™°g n | ( 



3- 

3 fclog(J) 



< k(D 1 -Tt/6+t/2+l/2)\ogn 
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Hence by choosing r n = Ton^^ 2 ^ +1 \logn) Ql3+1 with To large enough, the 

2 

above term multiplied by e~ nTn goes to with n, which achieves the proof 
of Theorem 2.1. 

4.2. Proof of Theorem 2.2. The proof for the control of the prior mass 
of Kullback-Leibler neighborhoods of the true density under the Dirichlet 
prior follows the same line as the proof under the adaptive prior. To find 
a lower bound on tt(B* (fo,T n ,p)), we construct a subset of ir(B*(fo,T n ,p)) 
whose probability under a Dirichlet process is easy to compute. Consider 
a £ (cia n ,C2a n ) and the discrete distribution P(e) = Ylj^oPj^sji 6 ) with 
N n = N ^/a(loga) 3 / 2 and a~ to = e < e\ < ■ ■ ■ < en u = 1 - and such 
that 

KL(/ , <fe, P ) < CoT^ V p (f ,g a , P ) < Ca-P. 

The above computations [leading to (4.4)] imply that there exists D± such 
that if \e — e'\ < a~ Dl , we can replace g a:£ by g ae i in the expression of 
9a, p without changing the order of approximation of /o by g a p. Hence 
we can assume that the point masses £j of the support of P satisfy \e~ — 
I > a~ Dl , j = 0, . . . , N n . We can thus construct a partition of [eq/2, 1 — 
eo/2], namely Uq, ...,U^ n , with Ej S Uj and Leb(Uj) > 2~ 1 a~ Dl for all j = 
1, . . . , N n where Leb denotes the Lebesgue measure. Let p > and Pi be any 
probability on [0, 1] satisfying 

(4.9) |-Pi(^-)-Pj-|<Pja" p Vj = 0,...,iV n . 

Then Pi[e /2, (1 - e /2)] > 1 - a - '. Since 

/•l-eo/2 

5a,Pi (^) > <?n,Pi = / 5q„ ,e fa) dPi fa) 

■/eo/2 

and using (4.1), we obtain 

KL(/ ,5a,Pi) <Ca~ /3 + J f (x)\og(g a! p(x)/g njPl (x))dx. 

Set p > /3, then, similarly to before, we obtain inequality (4.2) with g U} p 1 
instead of g a „,P'- When x < e~^° a , we use the calculations leading to (4.4) 
replacing g a> p' with g n ,p 1 , wich finally leads to (4.4) between g a ^p and g n ,p 1 ■ 
To bound fo /o (x) | log ( J ^) ) | p , note first that 

Wifa) -ffn.ftfa) < Pi[0,e ]CV^fa(l -x))" 1 < Ca^ +1 / 2 ( x (l - 

For the purpose of symmetry, we work on [0, 1/2] and we split [0, 1/2] into 
[0,e-P° a ] [e-P° a ,£ ] [e ,l/2]. Since 

g a ,p{x) > g a j t - \g a j x - g a ,p\ > — A 7; r ViJ > 0, 

4 x(l — x) 
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when x € (eo, 1/2), we have g a ,p(x) > cfo(x), since fo(x) > Cox k ° near the 
origin, for some positive constant c. Hence combining the above inequality 
with (4.2) based on g a p and g n ,Pi, we obtain that 



g n , Pl (x)x(l - x) fo(x)x(l - x) 

= 0(a-V p ), if x € (e , l/2)p > /9/p + 1/2 + (k + l)t . 
Moreover, (4.2) implies, also, that for all x G (e - ' 300 ' 1 , a~ to ), 



9n,I\ > 9a,p(x)/2 > (x/£ ) C 



leading to 



log 1 + 



9a,Pi ( x ) ~ ~9n,P x (x) 
~9n,P x (x) 



<log 1 + 



x(l — x) ' 



<Ca\log(x)\ Vie(e- fta ,a- to ; 



Also, if x < e P° a , using similar calculations to those used to derive (4.3), 
we obtain 



~9u,p x > CP 1 ([ eo ,l/2])-f^-e 2alo gM 
x(l — X) 



and 



Finally, we obtain 



. 9a,Px {x) ~ 9n,Pi (x) . 
log I H ' z I < Ca|log(x) 



9n,I\(x) 



Vx < e 



-0 o a 



fo(x] 



log 



fojx) 

9a,P! (x) 



dx < 0(a 



+ 



fo(x) 



log 



9n,Pi (x) 



dx 



^ga,pA x ), 

< a-* 0+p (loga) p + 0(pT p ) = 0{pT p ), 

whenever to > (3 + p, which implies p > (3/p + 1/2 + (/3 +p)(ko + 1). 

Under the Dirichlet prior, (Pi(Uo),Pi(Ui), . . . , Pi(Ujy n )) follows a Dirich- 
let (u(Uo),u(Ui), . . . , v(UN n )) with Uq being the complementary set of (U\ U 
• • • U f7jv n ). Using the fact that v(Uj) > Ca~ TlDl for all j, we obtain that 
there exist D 2 ; C*2 > such that 

7r(S n ) > exp{-D 2 N n log(a n )} > e~ C * N °^ 0°e <^) 5/2 . 

The above inequality can be derived, for instance, from Lemma A. 2 
of [4]. Setting a n = n 2/(2/?+i) ( logn )-5/(2/3+i) implies that Tn > 

Ton -/3/(2/9 + l)(log n )W(4/5+l) =£n . 
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We now bound the L\ entropy for the Dirichlet prior. To do so, we use 
the approximation of any mixture of Beta densities by a finite mixture, and 
we bound the entropy of a finite mixture. We cannot use the control of the 
entropy for the adaptive prior, however, since it is based on the prior mass 
of partitions of sieves J-" n>a (a), which is not easily controlled under Dirichlet 
priors. Let £o = exp{— a^/a^(\og a n ) 5//2 }, a n as above, and define 

T n = {F; F[e Q , 1 - e ] > 1 - a' 13 ,n l <a< a n (loga n ) 5 }. 
Under a Dirichlet z^-process, 



v[Q,e Q ] i/[l-e ,l] 



+ exp{-6^/a^(loga n ) 5/2 } 



u[0,l] u([0,l}) 

< Ca^exp{-a^/a^"(log a„) 5/2 }. 

For all F E J- n , define F n , the renormalized restriction of F, on [eo, 1 — £o]- 
Then 



\9a„,F n -g a „,F\h < 2a \ 



-0 



We can, therefore, assume that F[eq, 1 — £o] = 1 f° r au F £ J-' n . Then there 
exists a discrete probability 



N„ 



(4.10) 



P(e) = ^Pj^is), £j G (e ,l - £o) Vj, 



with N n < N ^/a(loga) 3/2 such that (3.16) is satisfied for F for all H (de- 
pending on No), and 



eo/3 



\g a ,F - g a ,p\(x)dx < 



1— eo 



\dF(e) + dP{e)\ 



eo/2 



g a , e {x)dx . 



When x < Eq/3 < e/3, using (A. 5) we obtain 



9a,e{x) < 



Cy/a ( 2x 



x(l — x) \x + e 
which implies that 

reo/3 



ae/(2e(l-e)) 



<CV^e- a/{2{1 ^ £)) (2xr^ 1 ^- 1 , 



(4.11) 



9a , £ (x)dx<Ca-^(l-e)e-^-^[^ 

<Ca- 1 / 2 (l-e)(3/2)-^ 2 ( 1 - £ )) 
= 0(a~ H ) \/H>0. 



a/(2(l- £ )) 
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By symmetry, the same bound is obtained for the integral over (1 — Eq/2, 1). 
Finally, for all H > 0, there exists Nq > and a probability measure P 
defined by (4.10) with N n = N ^(loga) 3/2 such that 

\\9a,F ~ 9a,p\\l < oT H . 
Hence the entropy of T n is bounded by the entropy of the set 

K = yP = YlPj9a,E j ;k<N n ;£ j G (s ,l- £o) Vj';n* < a < a n (log a„) 5 |. 

Let k < N n be fixed and g a ^p be a Beta mixture with k components. When 
\£j < <5a _7l_2 £j(l — £j) for all j < k and \pj —p'j\ < a -71-1 , if \x — £j\ < 
Ej(l — Ej)M a , then Lemma B.l implies 



\9a,e'. ~ 9a,e I < 9a, £] Ca 71 yJ\oga n , 

and if \x — £j\ > £j(l — £j)M a , then \x — £^\ > £j(l — Ej)^ and the convexity 
of x K(e,x) for all e, together with (3.4), implies 

'J J x(l — X) 

Combining the above inequality with (4.11) leads to 

(4.12) [ l \g a ^ -g a:£ ,\(x)dx = 0(a-^) 

Jo J 



and 

\g a ,P ~ g a ,P'\(x) dx = 0(a~P), 



f 

Jo 



10 

by choosing 71 large enough. Similarly, considering \a — a'\ < n~ B a, we 
obtain, using (3.5), 

\9a,e{x) - g a ', £ (x)\ < Cg^ £ {x)n" B , 

leading to 



1 

g a ,p - g a >,p> I (x) dx = 0{a~P), 



by choosing B large enough. The number of balls needed to cover (n* , a n (log a n 
under the above constraint is bounded by Cn B log n. The number of balls 
with radii 5ia„ 71 needed to cover the set Sk is bounded by 



C k a, 



-hrfi 
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The number of balls with radii £j(l — £j)a^ 7l_2 <5o needed to cover (eq, 1 — £o) 

{ac% 2 +^{\oga n fl 2 ) k . 
Finally, the metric entropy is bounded by 

Jn(jn) < 3k n /3loga n < 3ki/3^/a~(loga n ) 5 < Cut 2 , 
which achieves the proof of Theorem 2.2. 

APPENDIX A: PROOF OF LEMMA 3.2 
Throughout the proof, C denotes a generic constant. Let 



h{x) =9 a (x) 



1 = / 9a,e(x)de - 1. 
Jo 



The aim is to approximate Iq with an expansion of terms in the form 
Qj(x)a~^ 2 where Qj is a polynomial function. The idea is to split the in- 
tegral into three parts, Ii,l2,Iz corresponding to e < x — 5 X , e > x + 5 X and 
\x — e\ < 5 X where 5 X = 8qx(1 — x)y / log(a)/a, for some well-chosen 5o > 0. 
Note that this choice of 5 X comes from the approximation of the Beta density 
with a Gaussian with mean x and variance x 2 (l — x) 2 /a. We first prove that 
the first two parts are very small and the expansion is obtained from the 
third term. By convexity of K(e, x) as a function of e, K(e, x) > K(x — 5 X , x) 
for all e < x — 6 X , and K (e, x) > K(x + S x ,x) for all e > x + 6 X . Moreover, 



K(x -6 x ,x) = x[l- So{1 ~ X ^f^) logf 1 ^(l-x)v^H 



a l \ \ a 



5 x/log(a)\ ( 6 x/log(a) 
+ (1 — x) I 1 + x — —= log I l + x- 



= ^lQ g(a Ml-x) +0 / t _ 

2a \ \ a 

uniformly in x. Using a similar argument on K(x + 8 x ,x), we finally obtain, 
when a is large enough, 

(A.l) K(x-S x ,x)> f K(x + 5 x ,x)> f 

3x(l — x) 3x(l — x) 

Set 

Sir, 



rx-Ox 

h{x) = I g a , £ (x)de. 
Jo 
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First we consider x < 1/2, then using (3.4) and the fact that if a is large 
enough, the term in the square brackets in (3.4) with k = 1 is bounded by 
2, uniformly in e, we obtain that 

J l(x) < f X ~ Sx e ~5^(l- X )lo ga /(3e(l-e)) d£ ^ 



/ 2ttx(1 — x) Jo 
Let p = (Sqx(1 — x) loga)/6 then 

(A.2) h{x) < h^ e -ol^-^) < CV^e- 5 > ga/e . 



Now we consider x > 1/2, for which we use another type of upper bound: we 
split the interval (0,x — 5 X ) into (0, x(l — 5)) and (x(l — 5),x — 5 X ) for some 
well-chosen positive constant 5. For all e < x(l — 5), K{e, x) > K(x(l — 5),x). 
Since ulog(u) goes to zero when u goes to zero, there exists <5i > such that 
for all x > 1/2, and all 8i < 6 < 1, 

K{x{\ -S),x) = x(l - 5) log(l - 5) 

+ (1 -x + fa)log( 1 + 



1 -x 



> 5 2 xlog( 1 + 

\ 1 — x 

Therefore, using (3.4) and the same bound on the square brackets term in 
(3.4), as in the case x < 1/2, we obtain that if x > 1/2, 

x{X-S) 

g a>E (x)de 

~ V2^x(l-x)Jo V 2(1 -x) 



(A.3) 



-aS 2 /2 



l-x)\ 2(1 -x) 

< CoT H Vff > 0. 

We now study the integral over (x(l — 5), x — S x ). We use the following lower 
bound on K(e,x): a Taylor expansion of K(e,x) as a function of e around 
x leads to 

K{e,x) = e\og[- \ + (l-e)log 



x ) \ 1 — x 

(e-x) 2 / ; Q- — — —du 

Jq (x + u[e — x))(l — x — u[e — x)) 
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(A.4) 

(e-x) 2 f 1 / 2 1 



2 J (l-x + u(x-e)) 

= - x/2 - e/2) - log(l - x)). 

Let u = x — e, and note that the function u — > u/(x — u)(l — x + u) is in- 
creasing so that when a is large enough, uniformly in x, 

V— / ■• . /c,\ —au/(2(x—u)(l—x+u)) 

a I 1 — x + u 2 N 



f 1 _ x + u/2 \S/(2(l-S)(l- X+ S X )) 



< V2~kx(1 -x)\ 1-x 
for all u £ (5 x ,5x). Thus if a large enough and x > 1/2, 

-S x 

g a>e (x)de 

x(l-8) 

2 ^ rSx / u \ -<x5/(2(l-8)(l-x+8x)) 



V2^x(l - x) J Sx V 2(1 - x) 
< 1 / 5, \ -«*/(2d-«)(i-*+fa)) 



^a5/{2{l-5){l-x + 5x))-l\ 2(1 - x) 

< _^ e -55oV^\/ lo g(°)/(2(l-5)) = ( a --ff) 



for any H > 0. Finally, the above inequality, together with (A. 3) for x > 1/2 
and with (A. 2) for x <l/2 implies that 

h{x) = 0{pT H ) 

for all -ff > by choosing 5q large enough. We now consider the integral over 
(x + 5 X , 1) 



rx(l+S) rl 

h(x)= I g a , £ (x)de+ g aj£ (x)ds. 

Jx+8 X Jx(l+5) 



First let x < 1/2, then when e G (x + <5 a .,x(l + 5)) with 5 small enough, we 
can use (3.6) and 

x(l+S) 

g a , £ (x)de<2e- s >^ 2 . 

X+&x 
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When e G (x(l + 5), I), a Taylor expansion of K(e,x) as a function of e 
around x leads to 

K(e,x) = (e - x) 2 f - ; ^7- ^ -. ^-du 

y ' v 1 J Q {x + u{e - x))(l - x - u{e - x)) 

(A.5) > [ ' / j -du 

y ' 2 J (x + u(e-x)) 

= (log( (x + e)/2) - log x) . 

Thus letting u = e — x and noting that e(l — e) < x + u and that u/(x + u) > 
5/ (1 + 5) as soon as u > 8x, we obtain 



L 



, C V^ i X - X ( 2x \ -5/(2(1+5)) 
g at e{x)de < / - — ■ — du 



x(l+8) ' x JSx \2x + u 

/ x _ a 5/(2(l+<5))+l 

<2Ca- 1 / 2 (l + d - 

If x > 1/2 and 5 > x + by symmetry, we obtain the same result as in the 
case x < 1/2 and e < x — 5 X changing x into 1 — x. Finally, choosing 5$ large 
enough, we prove that for all x E [0, 1] , 

(A.6) h(x)+I 2 {x)=o{u- H ) 

(H depending on 5q). We now study the last term, Is(x). Using (3.6), and 
the fact that when £6 (x — 6 X , x + S x ), 

\R{x,e)\ < R'a k2+1 \x - e| 3 ( fc2+1 )(x(l - x))~ 3(fc2+1) 

<^a-( fc2+1 )/ 2 (loga) 3 ( fe2+1 )/ 2 , 

we obtain, for all ^2 > > 3(&2 — 1), and considering the change of vari- 
able u = v / a(x — e)/(x(l — x)), 

I 3 (x) = / g a , £ de - 1 



= ^ tojCW + ^ toBM + 0(a -( fc2+ i)/2 (log a) 3( fc2+ i)/ 2) 
z — ' a J ' 2 ^-^ ail 2, 

3=1 3=1 

= T M + 0(a-( fe2+1 )/ 2 (loga) 3 ( fc2+1 )/ 2 ), 
a 

choosing 5q large enough, and since (i\ = where the -Bj's are polynomial 
functions of x coming from and C{x) and where the remaining term is 
uniform in x. Lemma 3.2 is proved. 
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APPENDIX B: LEMMA B.l 



Lemma B.l. Let (S n ) n , (/3 n ) n and {p n )n be positive sequences decreasing 
to and assume that a n increases to infinity. Let 1 — 5 n > e, e' > 5 n and 
|s — e'| < Pn^O- — e)/>Join> then for all \x — e\ < Me(l — e)\/log a n /y/a^, if 
p n y/log a n goes to as n goes to infinity, for all k2,k^ > 1, 



9a n ,e(x) 



9a n ,e'{x) 



1 



fc 3 1 



forn large enough. Also, for allxe (/3 n ,l-/3 n ), if a}/ 2 p n \log(/3 n )\5 n 1 = o(l), 
for n large enough, 

9anA X ) 



9a n ,e'{x) 



< CfayVnllog^n)!*- 1 + a- fc2 / 2 (log 



Proof. First let \x — s\ < Me(l — e)\/log a n / y/a^, since \e — e'\ < p n s(l- 
£)/y/&n, we have that 

\x - e'\ < e(l - e)a~ 1/2 [Mi/log a n + p n ) 

< 2Me(l -e)a- l/2 y/tog< 



and 



(x - e') 1 = {x- e) 1 + (e - e') Cj{e - e'f~ V - e)'" 



i=i 



= (x- e) 1 + 0(a- l ' 2 Pn e l (1 - £ )< (loga^" 1 )/ 2 ). 

We control g a „,e/9a n ,s' using approximation (3.5). Then noting that when 
n is large enough, 



1 + 



x(l — x) 



C(x) + Q kl 



X — £ 

x(l — x) 



< 2 



and 



a n \e — e'\\x — e\ < 2x 2 (l — x) 2 p n al/ 2 (log a n ) 1//2 , 
we obtain that 

a n (x - e) 2 



2x 2 (l-x) 2 

_ a n (x - e') 2 
2x 2 (l-x) 2 



1+ {X ~ £ \ 



x(l — x) 

(x - gQ 
x(l — x) 



C{x) + Q 



ki 



X — £ 

x(l — x) 



1 + 



C(x) + Q kl 



X — £ 

x(l — x) 



< C[p 2 Tl + (\oga n ) l/2 p n + (log a n )p n a n 1/2 } 
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and finally, 



9a n , 



■0*0-1 



< C/ 3nV / IoI^ + 0(a^ fc2/2 £ fc2 (l -e) fc2 (loga n ) fc2 / 2 + a r 



&3 - 



Now let | a; — e| > Me(l — e)ylog(a)/ ■ v /o^" and x G (/3 n , 1 — /3 n ), we use (3.4) 
together with the above calculations and the fact that the function e —> 
elog(e)/(l — e) is bounded on [0,1], 

1 



9a n , 
9a n ,t 



■(x) = exp< —a 



l-e 

1, 

+ - log 



log - 



1 



7 lo § - 



l-e 



1 — x 



? log U 



l-e' 



exp< — a n (e — e 



l-e 



log(e) - — ^log(l - e) 



log(x)- 



log(l — x 



(l-e) 



(l-e) 

x (l + O^a" 1 + <*-**)), 
where e G (e, e'). Hence as soon as 1 — S n > e, e' > 5 n and x G (/3 n , 1 — /3 n ), 



log(l — x 



(l-e) 



< llog^)^- 1 , 



log(x) 



(l-e) 



< llog(^)^- 1 , 



which implies that if al/ 2 p n \\og({3 n )\5 n 1 is small enough, 



5q„, 



(ar)-l 



< Cay 2 / 9 n |log(/3 n )|(5- 1 + 0(p 



which achieves the proof of Lemma B.l. □ 



APPENDIX C: LEMMA C.l 

The following lemma allows us to control the ratio of constants of Beta 
densities. 

Lemma C.l. Let a,b > and < t\ < a, < T2 < b, let C,p denote 
generic positive constants. Let f) = a + b and f = n + T2 . W^e i/ien /i<roe the 
following results: 
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(ii) If a < 2, b > 2, then n > 2 and 

. /r>-Ti)r(6-T 2 )\ . /r(r? + r)\ 2n _ n , _ ^ 
Vr(a + Ti)r(6 + r 2 )y vri 7 ?- 1 ")/ °- r i 

(iii) If b<2, a > 2, i/ien things are symmetrical to the previous case. 

(iv) If a,b>2, z = l, 2, i/ien 

Proof. The proof of Lemma C.l comes from Taylor expansions of 
log(r(x)) and from the use of the relation, 

ip(x) = -- + ip(x + 1), 
x 

so that when x is small, is bounded by 1/x plus a constant, and if x 

is large, ip{x) is bounded by log(x) plus a constant. □ 
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